ref

imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

복습

- 아래와 같은 정보를 가지는 데이터 프레임을 생성하라.

	att	rep	mid	fin
0	65	45	0	10
1	25	45	20	50
2	45	45	10	60
3	35	35	10	80

df = pd.DataFrame({'att':[65,25,45,35], 'rep':[45,45,45,35], 'mid':[0,20,10,10], 'fin':[10,50,60,80]})
df

	att	rep	mid	fin
0	65	45	0	10
1	25	45	20	50
2	45	45	10	60
3	35	35	10	80

df.to_csv("sample.csv",index=False)
pd.read_csv("sample.csv")

	att	rep	mid	fin
0	65	45	0	10
1	25	45	20	50
2	45	45	10	60
3	35	35	10	80

- 이 데이터 프레임을 “sample.csv” 파일로 저장하라.

힌트 아래코드를 이용

df.to_csv("sample.csv",index=False)

- 저장된 데이터 프레임을 다시 불러오고 df2로 저장하라.

힌트

pd.read_csv("sample.csv") 이용

판다스: 인덱싱 1단계– 인덱싱의 4가지 컨셉

데이터프레임 준비

- 데이터준비

df=pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/dv2022.csv')
df

	att	rep	mid	fin
0	65	45	0	10
1	95	30	60	10
2	65	85	15	20
3	55	35	35	5
4	80	60	55	70
...	...	...	...	...
195	55	70	40	95
196	65	85	25	85
197	85	85	100	10
198	80	65	35	60
199	50	95	45	85

200 rows × 4 columns

- 앞으로는 위와 같은 df형태를 가정할 것이다. 즉 column의 이름은 문자열, row의 이름은 0부터 시작하는 정수로 가정한다.

- 아래와 같은 형태는 일단 생각하지 않는다.

pd.DataFrame({'att':[60,65,80,90],'rep':[50,100,90,100]},index=['규빈','영미','성준','혜미'])

	att	rep
규빈	60	50
영미	65	100
성준	80	90
혜미	90	100

df의 4가지 컨셉

- 원소에 접근하는 4가지 방법: ., [], .iloc[], .loc[]

컨셉1: 클래스느낌

- 컨셉1: df는 인스턴스이다. 그리고 df.att, df.rep,df.mid, df.fin 와 같이 col이름에 대응하는 속성이 있다.

df.head()

	att	rep	mid	fin
0	65	45	0	10
1	95	30	60	10
2	65	85	15	20
3	55	35	35	5
4	80	60	55	70

df.fin

0      10
1      10
2      20
3       5
4      70
       ..
195    95
196    85
197    10
198    60
199    85
Name: fin, Length: 200, dtype: int64

- 언제유용? col의 이름을 대충 알고 있을 경우 자동완성으로 쉽게 선택가능

컨셉2: 딕셔너리 + \(\alpha\) 느낌

- 컨셉2: df는 컬럼이름이 key, 컬럼의데이터가 value가 되는 dictionary로 이해할 수 있다. 즉 아래의 dct와 같은 딕셔너리로 이해할 수 있다.

dct = dict(df) 
#dct

(예시) .keys() 메소드를 이용하여 컬럼들의 이름을 살펴볼 수 있음.

dct.keys()

dict_keys(['att', 'rep', 'mid', 'fin'])

dct.keys(), df.keys()

(dict_keys(['att', 'rep', 'mid', 'fin']),
 Index(['att', 'rep', 'mid', 'fin'], dtype='object'))

`#` col indexing

- 예시1: dct가 가능하면 df도 가능하다.

df['att']
#dct['att']

0      65
1      95
2      65
3      55
4      80
       ..
195    55
196    65
197    85
198    80
199    50
Name: att, Length: 200, dtype: int64

- 예시2: dct가 가능하면 df도 가능하다. (2)

df.get('att')
#dct.get('att')

0      65
1      95
2      65
3      55
4      80
       ..
195    55
196    65
197    85
198    80
199    50
Name: att, Length: 200, dtype: int64

- 예시3: dct에서 불가능하지만 df에서 가능한것도 있다.

dct.get(['att','rep'])

TypeError: unhashable type: 'list'

df.get(['att','rep'])

	att	rep
0	65	45
1	95	30
2	65	85
3	55	35
4	80	60
...	...	...
195	55	70
196	65	85
197	85	85
198	80	65
199	50	95

200 rows × 2 columns

- 예시4: dct에서 불가능하지만 df에서 가능한것도 있다. (2)

dct[['att','rep']]

TypeError: unhashable type: 'list'

df[['att','rep']]

	att	rep
0	65	45
1	95	30
2	65	85
3	55	35
4	80	60
...	...	...
195	55	70
196	65	85
197	85	85
198	80	65
199	50	95

200 rows × 2 columns

`#` row indexing

- 예시5: dct에서 불가능하지만 df에서 가능한것도 있다. (3)

dct[:5]

TypeError: unhashable type: 'slice'

df[:5]

	att	rep	mid	fin
0	65	45	0	10
1	95	30	60	10
2	65	85	15	20
3	55	35	35	5
4	80	60	55	70

Quiz

df의 마지막 열을 출력
df의 마지막 행을 출력

컨셉3: 넘파이느낌

- 컨셉3: df.iloc은 넘파이에러이처럼 생각가능하다. 즉 아래의 arr와 같은 넘파이어레이로 생각가능하다.

arr = np.array(df)
#arr

`#` row indexing

- 예시1: 단일레이블

arr[0,:] # first row 
arr[0,] 
arr[0]

array([65, 45,  0, 10])

df.iloc[0,:] # first row 
df.iloc[0,] 
df.iloc[0]

att    65
rep    45
mid     0
fin    10
Name: 0, dtype: int64

- 예시2: 레이블의 리스트

arr[[0,1,2],:] # 처음 3개의 row 선택 
arr[[0,1,2],] 
arr[[0,1,2]]

array([[65, 45,  0, 10],
       [95, 30, 60, 10],
       [65, 85, 15, 20]])

df.iloc[[0,1,2],:] # 처음 3개의 row 선택 
df.iloc[[0,1,2],] 
df.iloc[[0,1,2]]

	att	rep	mid	fin
0	65	45	0	10
1	95	30	60	10
2	65	85	15	20

- 예시3: 슬라이싱

arr[0:3,:] # 처음 3개의 row선택, 끝점포함X
arr[0:3,] 
arr[0:3]

array([[65, 45,  0, 10],
       [95, 30, 60, 10],
       [65, 85, 15, 20]])

df.iloc[0:3,:] # 처음 3개의 row선택, 끝점포함X
df.iloc[0:3,] 
df.iloc[0:3]

	att	rep	mid	fin
0	65	45	0	10
1	95	30	60	10
2	65	85	15	20

`#` col indexing

- 예시1: 단일레이블

df.iloc[:,0] # first column 
# arr[:,0] # first column

0      65
1      95
2      65
3      55
4      80
       ..
195    55
196    65
197    85
198    80
199    50
Name: att, Length: 200, dtype: int64

- 예시2: 레이블의 리스트

df.iloc[:,[0,2]] # col1, col3 을 선택
# arr[:,[0,2]] # col1, col3 을 선택

	att	mid
0	65	0
1	95	60
2	65	15
3	55	35
4	80	55
...	...	...
195	55	40
196	65	25
197	85	100
198	80	35
199	50	45

200 rows × 2 columns

- 예시3: 슬랑이싱

df.iloc[:,0:3] # 처음 3개의 col선택, 끝점포함X
#arr[:,0:3]

	att	rep	mid
0	65	45	0
1	95	30	60
2	65	85	15
3	55	35	35
4	80	60	55
...	...	...	...
195	55	70	40
196	65	85	25
197	85	85	100
198	80	65	35
199	50	95	45

200 rows × 3 columns

`#` row + col indexing

df.iloc[::2,:] ## 홀수번째(=짝수인덱스)행을 출력,

	att	rep	mid	fin
0	65	45	0	10
2	65	85	15	20
4	80	60	55	70
6	65	70	60	75
8	95	55	65	90
...	...	...	...	...
190	95	35	40	95
192	100	40	80	80
194	65	40	65	70
196	65	85	25	85
198	80	65	35	60

100 rows × 4 columns

Quiz

df의 마지막 열을 출력
df의 마지막 행을 출력
df의 마지막 행의 마지막 열을 출력
df의 짝수번째 열을 출력

컨셉4: 데이터프레임 느낌

- 컨셉4: df.loc은 새로운 느낌..

`#` row indexing

- 예시1: 단일레이블

df.loc[0,:] # 첫번째 row를 선택 
df.loc[0,]
df.loc[0]

att    65
rep    45
mid     0
fin    10
Name: 0, dtype: int64

- 예시2: 레이블의 리스트

df.loc[[0,1,2],:] # 처음 3개의 row를 선택 
df.loc[[0,1,2],]
df.loc[[0,1,2]]

	att	rep	mid	fin
0	65	45	0	10
1	95	30	60	10
2	65	85	15	20

- 예시3: 슬라이싱 (끝점포함 O)

df.loc[0:3,:] # 처음 4개의 row를 선택, 끝점포함 
df.loc[0:3,]
df.loc[0:3]

	att	rep	mid	fin
0	65	45	0	10
1	95	30	60	10
2	65	85	15	20
3	55	35	35	5

Quiz

2번째 row부터 5번째 row까지 출력하라. loc과 iloc으로 각각 출력해볼 것 (슬라이싱을 이용)

`#` col indexing

- 예시1: 단일레이블

df.loc[:,'att']

0      65
1      95
2      65
3      55
4      80
       ..
195    55
196    65
197    85
198    80
199    50
Name: att, Length: 200, dtype: int64

- 예시2: 레이블의 리스트

df.loc[:,['att','mid']]

	att	mid
0	65	0
1	95	60
2	65	15
3	55	35
4	80	55
...	...	...
195	55	40
196	65	25
197	85	100
198	80	35
199	50	45

200 rows × 2 columns

- 예시3: 슬라이싱 (끝점포함 O)

df.loc[:,'att':'mid'] # 끝점포함

	att	rep	mid
0	65	45	0
1	95	30	60
2	65	85	15
3	55	35	35
4	80	60	55
...	...	...	...
195	55	70	40
196	65	85	25
197	85	85	100
198	80	65	35
199	50	95	45

200 rows × 3 columns

`#` row + col indexing

df.loc[::-1,'att':'mid'] # 끝점포함

	att	rep	mid
199	50	95	45
198	80	65	35
197	85	85	100
196	65	85	25
195	55	70	40
...	...	...	...
4	80	60	55
3	55	35	35
2	65	85	15
1	95	30	60
0	65	45	0

200 rows × 3 columns

Quiz

출석점수의 짝수번째 row를 출력하라.

컨셉1~4 정리

	`.`	`[]`	`.iloc`	`.loc`
row/단일레이블	X	X	O	O
col/단일레이블	O	O	O	O
row/레이블리스트	X	X	O	O
col/레이블리스트	X	O	O	O
row/슬라이싱	X	O	O	O
col/슬라이싱	X	X	O	O

- col 이름을 알아야하는 부담감 - . : 앞글자만 대충 알아도 자동완성 가능 - []: 정확한 col 이름을 알아야 함 - .loc: 보통 정확한 col 이름을 알아야 하지만 슬라이싱 이용시 양 끝의 컬럼이름만 알면 무방 - .iloc: 정확한 col 이름을 몰라도 번호로 인덱싱 가능

- 자주하는 실수

df['att'] # 가능 
# df.loc['att'] # 불가능
df.loc[:,'att'] # 가능

0      65
1      95
2      65
3      55
4      80
       ..
195    55
196    65
197    85
198    80
199    50
Name: att, Length: 200, dtype: int64

판다스: 인덱싱 2단계– 필터링(특정조건에 맞는 row를 선택)

att > 90 and rep < 50

- 방법1: .query()를 이용

df.query('att>90 and rep<50')

	att	rep	mid	fin
1	95	30	60	10
12	95	35	0	25
48	95	45	35	80
56	95	25	95	90
78	95	45	90	35
107	100	30	60	65
112	100	35	70	0
113	95	45	55	65
163	100	25	10	20
174	100	40	40	15
176	100	30	70	70
184	100	30	30	85
190	95	35	40	95
192	100	40	80	80

df.query('(att>90)&(rep<50)')

	att	rep	mid	fin
1	95	30	60	10
12	95	35	0	25
48	95	45	35	80
56	95	25	95	90
78	95	45	90	35
107	100	30	60	65
112	100	35	70	0
113	95	45	55	65
163	100	25	10	20
174	100	40	40	15
176	100	30	70	70
184	100	30	30	85
190	95	35	40	95
192	100	40	80	80

df.query('att>90 & rep<50')

	att	rep	mid	fin
1	95	30	60	10
12	95	35	0	25
48	95	45	35	80
56	95	25	95	90
78	95	45	90	35
107	100	30	60	65
112	100	35	70	0
113	95	45	55	65
163	100	25	10	20
174	100	40	40	15
176	100	30	70	70
184	100	30	30	85
190	95	35	40	95
192	100	40	80	80

- 방법2: [], .iloc, .loc

(예비학습)

True&True, True&False, False&True, False&False

(True, False, False, False)

True|True, True|False, False|True, False|False

(True, True, True, False)

(df.att>90) & (df.rep<50)

0      False
1       True
2      False
3      False
4      False
       ...  
195    False
196    False
197    False
198    False
199    False
Length: 200, dtype: bool

예비학습 끝

df[(df.att > 90)&(df.rep < 50)]
df.loc[(df.att > 90)&(df.rep < 50)]
df.iloc[list((df.att > 90)&(df.rep < 50))]

	att	rep	mid	fin
1	95	30	60	10
12	95	35	0	25
48	95	45	35	80
56	95	25	95	90
78	95	45	90	35
107	100	30	60	65
112	100	35	70	0
113	95	45	55	65
163	100	25	10	20
174	100	40	40	15
176	100	30	70	70
184	100	30	30	85
190	95	35	40	95
192	100	40	80	80

- 방법3: [], .iloc, .loc // map, lambda

df.att > 90

0      False
1       True
2      False
3      False
4      False
       ...  
195    False
196    False
197    False
198    False
199    False
Name: att, Length: 200, dtype: bool

df[list(map(lambda x,y: (x>90)&(y<50), df.att, df.rep))]
# df[map(lambda x,y: (x>90)&(y<50), df.att, df.rep)] # 이것은 불가능

	att	rep	mid	fin
1	95	30	60	10
12	95	35	0	25
48	95	45	35	80
56	95	25	95	90
78	95	45	90	35
107	100	30	60	65
112	100	35	70	0
113	95	45	55	65
163	100	25	10	20
174	100	40	40	15
176	100	30	70	70
184	100	30	30	85
190	95	35	40	95
192	100	40	80	80

df.iloc[list(map(lambda x,y: (x>90)&(y<50), df.att, df.rep))]
df.iloc[map(lambda x,y: (x>90)&(y<50), df.att, df.rep)]

	att	rep	mid	fin
1	95	30	60	10
12	95	35	0	25
48	95	45	35	80
56	95	25	95	90
78	95	45	90	35
107	100	30	60	65
112	100	35	70	0
113	95	45	55	65
163	100	25	10	20
174	100	40	40	15
176	100	30	70	70
184	100	30	30	85
190	95	35	40	95
192	100	40	80	80

df.loc[list(map(lambda x,y: (x>90)&(y<50), df.att, df.rep))]
df.loc[map(lambda x,y: (x>90)&(y<50), df.att, df.rep)]

	att	rep	mid	fin
1	95	30	60	10
12	95	35	0	25
48	95	45	35	80
56	95	25	95	90
78	95	45	90	35
107	100	30	60	65
112	100	35	70	0
113	95	45	55	65
163	100	25	10	20
174	100	40	40	15
176	100	30	70	70
184	100	30	30	85
190	95	35	40	95
192	100	40	80	80

att > mean(att)

- 방법1: .query()를 이용

df.query('att> att.mean()')

	att	rep	mid	fin
1	95	30	60	10
4	80	60	55	70
8	95	55	65	90
9	90	25	95	50
11	95	60	25	55
...	...	...	...	...
184	100	30	30	85
190	95	35	40	95
192	100	40	80	80
197	85	85	100	10
198	80	65	35	60

95 rows × 4 columns

- 방법2: [], .iloc, .loc

df[df.att > df.att.mean()]
df.loc[df.att > df.att.mean()]
df.iloc[list(df.att > df.att.mean())]

	att	rep	mid	fin
1	95	30	60	10
4	80	60	55	70
8	95	55	65	90
9	90	25	95	50
11	95	60	25	55
...	...	...	...	...
184	100	30	30	85
190	95	35	40	95
192	100	40	80	80
197	85	85	100	10
198	80	65	35	60

95 rows × 4 columns

- 방법3: [], .iloc, .loc // map, lambda

df[list(map(lambda x: x>df.att.mean() , df.att))]
# df[map(lambda x: x>df.att.mean() , df.att)] # 이것은 불가능

	att	rep	mid	fin
1	95	30	60	10
4	80	60	55	70
8	95	55	65	90
9	90	25	95	50
11	95	60	25	55
...	...	...	...	...
184	100	30	30	85
190	95	35	40	95
192	100	40	80	80
197	85	85	100	10
198	80	65	35	60

95 rows × 4 columns

df.iloc[list(map(lambda x: x>df.att.mean() , df.att))]
df.iloc[map(lambda x: x>df.att.mean() , df.att)]

	att	rep	mid	fin
1	95	30	60	10
4	80	60	55	70
8	95	55	65	90
9	90	25	95	50
11	95	60	25	55
...	...	...	...	...
184	100	30	30	85
190	95	35	40	95
192	100	40	80	80
197	85	85	100	10
198	80	65	35	60

95 rows × 4 columns

df.loc[list(map(lambda x: x>df.att.mean() , df.att))]
df.loc[map(lambda x: x>df.att.mean() , df.att)]

	att	rep	mid	fin
1	95	30	60	10
4	80	60	55	70
8	95	55	65	90
9	90	25	95	50
11	95	60	25	55
...	...	...	...	...
184	100	30	30	85
190	95	35	40	95
192	100	40	80	80
197	85	85	100	10
198	80	65	35	60

95 rows × 4 columns

	`.`	`[]`	`.iloc`	`.loc`
row/단일레이블	X	X	O	O
col/단일레이블	O	O	O	O
row/레이블리스트	X	X	O	O
col/레이블리스트	X	O	O	O
row/슬라이싱	X	O	O	O
col/슬라이싱	X	X	O	O
row/bool,list	X	O	O	O
row/bool,ser	X	O	X	O
row/bool,map	X	X	O	O

Quiz

아래와 같은 데이터 프레임을 만들어라.

	name	score
0	Guebin	50
1	Jaein	60
2	Daho	70
3	Seoyeon	80

이름이 5글자 이상이고 점수가 55점 이상한 학생을 출력하라.

df=pd.DataFrame({'name':['Guebin','Jaein','Daho','Seoyeon'],'score':[50,60,70,80]})
df

	name	score
0	Guebin	50
1	Jaein	60
2	Daho	70
3	Seoyeon	80

df[list(map((lambda name,score: (len(name) >= 5)& (score>55)),df.name,df.score))]

	name	score
1	Jaein	60
3	Seoyeon	80

판다스: 인덱싱 3단계– column의 선택 (with 실전예제)

데이터

df=pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/movie.csv')
df

URLError: <urlopen error [Errno -3] Temporary failure in name resolution>

Quiz 열의 이름을 출력하여 보자.

기본인덱싱 (df 인덱싱공부 1단계 내용)

- color ~ num_voted_user 를 뽑고 + aspect_ratio 도 추가적으로 뽑고싶다. -> loc으로는 못하겠어요..

df.loc[:,['color':'num_voted_users','aspect_ratio']]

SyntaxError: invalid syntax (1210972629.py, line 1)

- (팁) 복잡한 조건은 iloc으로 쓰는게 편할때가 있다. \(\to\) 그런데 df.columns 변수들이 몇번인지 알아보기 힘듬 \(\to\) 아래와 같이 하면 열의 이름을 인덱스와 함께 출력할 수 있음

pd.Series(df.columns)

0                         color
1                 director_name
2        num_critic_for_reviews
3                      duration
4       director_facebook_likes
5        actor_3_facebook_likes
6                  actor_2_name
7        actor_1_facebook_likes
8                         gross
9                        genres
10                 actor_1_name
11                  movie_title
12              num_voted_users
13    cast_total_facebook_likes
14                 actor_3_name
15         facenumber_in_poster
16                plot_keywords
17              movie_imdb_link
18         num_user_for_reviews
19                     language
20                      country
21               content_rating
22                       budget
23                   title_year
24       actor_2_facebook_likes
25                   imdb_score
26                 aspect_ratio
27         movie_facebook_likes
dtype: object

list(range(13))+[26]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 26]

df.iloc[:,list(range(13))+[26]]

	color	director_name	num_critic_for_reviews	duration	director_facebook_likes	actor_3_facebook_likes	actor_2_name	actor_1_facebook_likes	gross	genres	actor_1_name	movie_title	num_voted_users	aspect_ratio
0	Color	James Cameron	723.0	178.0	0.0	855.0	Joel David Moore	1000.0	760505847.0	Action\|Adventure\|Fantasy\|Sci-Fi	CCH Pounder	Avatar	886204	1.78
1	Color	Gore Verbinski	302.0	169.0	563.0	1000.0	Orlando Bloom	40000.0	309404152.0	Action\|Adventure\|Fantasy	Johnny Depp	Pirates of the Caribbean: At World's End	471220	2.35
2	Color	Sam Mendes	602.0	148.0	0.0	161.0	Rory Kinnear	11000.0	200074175.0	Action\|Adventure\|Thriller	Christoph Waltz	Spectre	275868	2.35
3	Color	Christopher Nolan	813.0	164.0	22000.0	23000.0	Christian Bale	27000.0	448130642.0	Action\|Thriller	Tom Hardy	The Dark Knight Rises	1144337	2.35
4	NaN	Doug Walker	NaN	NaN	131.0	NaN	Rob Walker	131.0	NaN	Documentary	Doug Walker	Star Wars: Episode VII - The Force Awakens	8	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4911	Color	Scott Smith	1.0	87.0	2.0	318.0	Daphne Zuniga	637.0	NaN	Comedy\|Drama	Eric Mabius	Signed Sealed Delivered	629	NaN
4912	Color	NaN	43.0	43.0	NaN	319.0	Valorie Curry	841.0	NaN	Crime\|Drama\|Mystery\|Thriller	Natalie Zea	The Following	73839	16.00
4913	Color	Benjamin Roberds	13.0	76.0	0.0	0.0	Maxwell Moody	0.0	NaN	Drama\|Horror\|Thriller	Eva Boehnke	A Plague So Pleasant	38	NaN
4914	Color	Daniel Hsia	14.0	100.0	0.0	489.0	Daniel Henney	946.0	10443.0	Comedy\|Drama\|Romance	Alan Ruck	Shanghai Calling	1255	2.35
4915	Color	Jon Gunn	43.0	90.0	16.0	16.0	Brian Herzlinger	86.0	85222.0	Documentary	John August	My Date with Drew	4285	1.85

4916 rows × 14 columns

actor라는 단어가 포함된 column 선택

- 다시 열의 이름들을 확인

df.columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

- 방법1

df.iloc[:,list(map(lambda x : 'actor' in x, df.columns) )]

	actor_3_facebook_likes	actor_2_name	actor_1_facebook_likes	actor_1_name	actor_3_name	actor_2_facebook_likes
0	855.0	Joel David Moore	1000.0	CCH Pounder	Wes Studi	936.0
1	1000.0	Orlando Bloom	40000.0	Johnny Depp	Jack Davenport	5000.0
2	161.0	Rory Kinnear	11000.0	Christoph Waltz	Stephanie Sigman	393.0
3	23000.0	Christian Bale	27000.0	Tom Hardy	Joseph Gordon-Levitt	23000.0
4	NaN	Rob Walker	131.0	Doug Walker	NaN	12.0
...	...	...	...	...	...	...
4911	318.0	Daphne Zuniga	637.0	Eric Mabius	Crystal Lowe	470.0
4912	319.0	Valorie Curry	841.0	Natalie Zea	Sam Underwood	593.0
4913	0.0	Maxwell Moody	0.0	Eva Boehnke	David Chandler	0.0
4914	489.0	Daniel Henney	946.0	Alan Ruck	Eliza Coupe	719.0
4915	16.0	Brian Herzlinger	86.0	John August	Jon Gunn	23.0

4916 rows × 6 columns

- 방법2

df.loc[:,list(map(lambda x : 'actor' in x, df.columns) )]

	actor_3_facebook_likes	actor_2_name	actor_1_facebook_likes	actor_1_name	actor_3_name	actor_2_facebook_likes
0	855.0	Joel David Moore	1000.0	CCH Pounder	Wes Studi	936.0
1	1000.0	Orlando Bloom	40000.0	Johnny Depp	Jack Davenport	5000.0
2	161.0	Rory Kinnear	11000.0	Christoph Waltz	Stephanie Sigman	393.0
3	23000.0	Christian Bale	27000.0	Tom Hardy	Joseph Gordon-Levitt	23000.0
4	NaN	Rob Walker	131.0	Doug Walker	NaN	12.0
...	...	...	...	...	...	...
4911	318.0	Daphne Zuniga	637.0	Eric Mabius	Crystal Lowe	470.0
4912	319.0	Valorie Curry	841.0	Natalie Zea	Sam Underwood	593.0
4913	0.0	Maxwell Moody	0.0	Eva Boehnke	David Chandler	0.0
4914	489.0	Daniel Henney	946.0	Alan Ruck	Eliza Coupe	719.0
4915	16.0	Brian Herzlinger	86.0	John August	Jon Gunn	23.0

4916 rows × 6 columns

- 방법3

df.iloc[:,map(lambda x : 'actor' in x, df.columns)]

	actor_3_facebook_likes	actor_2_name	actor_1_facebook_likes	actor_1_name	actor_3_name	actor_2_facebook_likes
0	855.0	Joel David Moore	1000.0	CCH Pounder	Wes Studi	936.0
1	1000.0	Orlando Bloom	40000.0	Johnny Depp	Jack Davenport	5000.0
2	161.0	Rory Kinnear	11000.0	Christoph Waltz	Stephanie Sigman	393.0
3	23000.0	Christian Bale	27000.0	Tom Hardy	Joseph Gordon-Levitt	23000.0
4	NaN	Rob Walker	131.0	Doug Walker	NaN	12.0
...	...	...	...	...	...	...
4911	318.0	Daphne Zuniga	637.0	Eric Mabius	Crystal Lowe	470.0
4912	319.0	Valorie Curry	841.0	Natalie Zea	Sam Underwood	593.0
4913	0.0	Maxwell Moody	0.0	Eva Boehnke	David Chandler	0.0
4914	489.0	Daniel Henney	946.0	Alan Ruck	Eliza Coupe	719.0
4915	16.0	Brian Herzlinger	86.0	John August	Jon Gunn	23.0

4916 rows × 6 columns

- 방법4

df.loc[:,map(lambda x : 'actor' in x, df.columns)]

	actor_3_facebook_likes	actor_2_name	actor_1_facebook_likes	actor_1_name	actor_3_name	actor_2_facebook_likes
0	855.0	Joel David Moore	1000.0	CCH Pounder	Wes Studi	936.0
1	1000.0	Orlando Bloom	40000.0	Johnny Depp	Jack Davenport	5000.0
2	161.0	Rory Kinnear	11000.0	Christoph Waltz	Stephanie Sigman	393.0
3	23000.0	Christian Bale	27000.0	Tom Hardy	Joseph Gordon-Levitt	23000.0
4	NaN	Rob Walker	131.0	Doug Walker	NaN	12.0
...	...	...	...	...	...	...
4911	318.0	Daphne Zuniga	637.0	Eric Mabius	Crystal Lowe	470.0
4912	319.0	Valorie Curry	841.0	Natalie Zea	Sam Underwood	593.0
4913	0.0	Maxwell Moody	0.0	Eva Boehnke	David Chandler	0.0
4914	489.0	Daniel Henney	946.0	Alan Ruck	Eliza Coupe	719.0
4915	16.0	Brian Herzlinger	86.0	John August	Jon Gunn	23.0

4916 rows × 6 columns

s로 끝나는 column 선택

- 방법1

df.iloc[:,map(lambda x: 's' == x[-1],df.columns )]

	num_critic_for_reviews	director_facebook_likes	actor_3_facebook_likes	actor_1_facebook_likes	gross	genres	num_voted_users	cast_total_facebook_likes	plot_keywords	num_user_for_reviews	actor_2_facebook_likes	movie_facebook_likes
0	723.0	0.0	855.0	1000.0	760505847.0	Action\|Adventure\|Fantasy\|Sci-Fi	886204	4834	avatar\|future\|marine\|native\|paraplegic	3054.0	936.0	33000
1	302.0	563.0	1000.0	40000.0	309404152.0	Action\|Adventure\|Fantasy	471220	48350	goddess\|marriage ceremony\|marriage proposal\|pi...	1238.0	5000.0	0
2	602.0	0.0	161.0	11000.0	200074175.0	Action\|Adventure\|Thriller	275868	11700	bomb\|espionage\|sequel\|spy\|terrorist	994.0	393.0	85000
3	813.0	22000.0	23000.0	27000.0	448130642.0	Action\|Thriller	1144337	106759	deception\|imprisonment\|lawlessness\|police offi...	2701.0	23000.0	164000
4	NaN	131.0	NaN	131.0	NaN	Documentary	8	143	NaN	NaN	12.0	0
...	...	...	...	...	...	...	...	...	...	...	...	...
4911	1.0	2.0	318.0	637.0	NaN	Comedy\|Drama	629	2283	fraud\|postal worker\|prison\|theft\|trial	6.0	470.0	84
4912	43.0	NaN	319.0	841.0	NaN	Crime\|Drama\|Mystery\|Thriller	73839	1753	cult\|fbi\|hideout\|prison escape\|serial killer	359.0	593.0	32000
4913	13.0	0.0	0.0	0.0	NaN	Drama\|Horror\|Thriller	38	0	NaN	3.0	0.0	16
4914	14.0	0.0	489.0	946.0	10443.0	Comedy\|Drama\|Romance	1255	2386	NaN	9.0	719.0	660
4915	43.0	16.0	16.0	86.0	85222.0	Documentary	4285	163	actress name in title\|crush\|date\|four word tit...	84.0	23.0	456

4916 rows × 12 columns

- 방법2

df.loc[:,map(lambda x: 's' == x[-1],df.columns )]

	num_critic_for_reviews	director_facebook_likes	actor_3_facebook_likes	actor_1_facebook_likes	gross	genres	num_voted_users	cast_total_facebook_likes	plot_keywords	num_user_for_reviews	actor_2_facebook_likes	movie_facebook_likes
0	723.0	0.0	855.0	1000.0	760505847.0	Action\|Adventure\|Fantasy\|Sci-Fi	886204	4834	avatar\|future\|marine\|native\|paraplegic	3054.0	936.0	33000
1	302.0	563.0	1000.0	40000.0	309404152.0	Action\|Adventure\|Fantasy	471220	48350	goddess\|marriage ceremony\|marriage proposal\|pi...	1238.0	5000.0	0
2	602.0	0.0	161.0	11000.0	200074175.0	Action\|Adventure\|Thriller	275868	11700	bomb\|espionage\|sequel\|spy\|terrorist	994.0	393.0	85000
3	813.0	22000.0	23000.0	27000.0	448130642.0	Action\|Thriller	1144337	106759	deception\|imprisonment\|lawlessness\|police offi...	2701.0	23000.0	164000
4	NaN	131.0	NaN	131.0	NaN	Documentary	8	143	NaN	NaN	12.0	0
...	...	...	...	...	...	...	...	...	...	...	...	...
4911	1.0	2.0	318.0	637.0	NaN	Comedy\|Drama	629	2283	fraud\|postal worker\|prison\|theft\|trial	6.0	470.0	84
4912	43.0	NaN	319.0	841.0	NaN	Crime\|Drama\|Mystery\|Thriller	73839	1753	cult\|fbi\|hideout\|prison escape\|serial killer	359.0	593.0	32000
4913	13.0	0.0	0.0	0.0	NaN	Drama\|Horror\|Thriller	38	0	NaN	3.0	0.0	16
4914	14.0	0.0	489.0	946.0	10443.0	Comedy\|Drama\|Romance	1255	2386	NaN	9.0	719.0	660
4915	43.0	16.0	16.0	86.0	85222.0	Documentary	4285	163	actress name in title\|crush\|date\|four word tit...	84.0	23.0	456

4916 rows × 12 columns

c 혹은 d로 시작하는 column 선택

- 방법1

df.iloc[:,map(lambda x: 'c' == x[0] or 'd' == x[0] ,df.columns )]

	color	director_name	duration	director_facebook_likes	cast_total_facebook_likes	country	content_rating
0	Color	James Cameron	178.0	0.0	4834	USA	PG-13
1	Color	Gore Verbinski	169.0	563.0	48350	USA	PG-13
2	Color	Sam Mendes	148.0	0.0	11700	UK	PG-13
3	Color	Christopher Nolan	164.0	22000.0	106759	USA	PG-13
4	NaN	Doug Walker	NaN	131.0	143	NaN	NaN
...	...	...	...	...	...	...	...
4911	Color	Scott Smith	87.0	2.0	2283	Canada	NaN
4912	Color	NaN	43.0	NaN	1753	USA	TV-14
4913	Color	Benjamin Roberds	76.0	0.0	0	USA	NaN
4914	Color	Daniel Hsia	100.0	0.0	2386	USA	PG-13
4915	Color	Jon Gunn	90.0	16.0	163	USA	PG

4916 rows × 7 columns

- 방법2

df.loc[:,map(lambda x: 'c' == x[0] or 'd' == x[0] ,df.columns )]

	color	director_name	duration	director_facebook_likes	cast_total_facebook_likes	country	content_rating
0	Color	James Cameron	178.0	0.0	4834	USA	PG-13
1	Color	Gore Verbinski	169.0	563.0	48350	USA	PG-13
2	Color	Sam Mendes	148.0	0.0	11700	UK	PG-13
3	Color	Christopher Nolan	164.0	22000.0	106759	USA	PG-13
4	NaN	Doug Walker	NaN	131.0	143	NaN	NaN
...	...	...	...	...	...	...	...
4911	Color	Scott Smith	87.0	2.0	2283	Canada	NaN
4912	Color	NaN	43.0	NaN	1753	USA	TV-14
4913	Color	Benjamin Roberds	76.0	0.0	0	USA	NaN
4914	Color	Daniel Hsia	100.0	0.0	2386	USA	PG-13
4915	Color	Jon Gunn	90.0	16.0	163	USA	PG

4916 rows × 7 columns

quiz

column이름에 _가 포함되어있는 열을 모두 출력하라.
column이름에 _가 포함되어있는 열은 모두 몇개인가?

판다스: 새로운 열의 할당 1단계

방법1: concat

df = pd.DataFrame({'a':[1,2,3],'b':[2,3,4]})
df

	a	b
0	1	2
1	2	3
2	3	4

_df = pd.DataFrame({'c':[3,4,5]}) 
_df

	c
0	3
1	4
2	5

pd.concat([df,_df],axis=1)

	a	b	c
0	1	2	3
1	2	3	4
2	3	4	5

방법2: 4가지 컨셉에 따른 할당

`#` 컨셉1: 불가능

df = pd.DataFrame({'a':[1,2,3],'b':[2,3,4]})
df

	a	b
0	1	2
1	2	3
2	3	4

df.c = pd.Series([1,2,3]) 
df

/home/cgb2/anaconda3/envs/py37/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  """Entry point for launching an IPython kernel.

	a	b
0	1	2
1	2	3
2	3	4

`#` 컨셉2: 가능

(예시1)

df = pd.DataFrame({'a':[1,2,3],'b':[2,3,4]})
df

	a	b
0	1	2
1	2	3
2	3	4

df['c']=[3,4,5]
df

	a	b	c
0	1	2	3
1	2	3	4
2	3	4	5

(예시2)

df = pd.DataFrame({'a':[1,2,3],'b':[2,3,4]})
df

	a	b
0	1	2
1	2	3
2	3	4

df[['c','d']]=np.array([[3,4,5],[4,5,6]]).T # 굳이.. 
df

	a	b	c	d
0	1	2	3	4
1	2	3	4	5
2	3	4	5	6

(예시3)

df = pd.DataFrame({'a':[1,2,3],'b':[2,3,4]})
df

	a	b
0	1	2
1	2	3
2	3	4

df['c'],df['d']=[3,4,5],[4,5,6]
df

	a	b	c	d
0	1	2	3	4
1	2	3	4	5
2	3	4	5	6

`#` 컨셉3: 불가능

(예시1)

df = pd.DataFrame({'a':[1,2,3],'b':[2,3,4]})
df

	a	b
0	1	2
1	2	3
2	3	4

df.iloc[:,2] = [3,4,5] 
df

IndexError: iloc cannot enlarge its target object

`#` 컨셉4: 가능

(예시1)

df = pd.DataFrame({'a':[1,2,3],'b':[2,3,4]})
df

	a	b
0	1	2
1	2	3
2	3	4

df.loc[:,'c'] = [3,4,5] 
df

	a	b	c
0	1	2	3
1	2	3	4
2	3	4	5

(예시2)

df = pd.DataFrame({'a':[1,2,3],'b':[2,3,4]})
df

	a	b
0	1	2
1	2	3
2	3	4

df.loc[:,['c','d']] = np.array([[3,4,5],[4,5,6]]).T # 이거 솔직히 되는지 몰랐어요.. 
df

	a	b	c	d
0	1	2	3	4
1	2	3	4	5
2	3	4	5	6

(예시3)

df = pd.DataFrame({'a':[1,2,3],'b':[2,3,4]})
df

	a	b
0	1	2
1	2	3
2	3	4

df.loc[:,'c'],df.loc[:,'d'] = [3,4,5],[4,5,6] 
df

	a	b	c	d
0	1	2	3	4
1	2	3	4	5
2	3	4	5	6

방법3: `.assign`으로 할당 (\(\star\)) – 제 최애

df = pd.DataFrame({'a':[1,2,3],'b':[2,3,4]})
df

	a	b
0	1	2
1	2	3
2	3	4

df.assign(c=[3,4,5])

	a	b	c
0	1	2	3
1	2	3	4
2	3	4	5

df.assign(c=[3,4,5],d=[4,5,6])

	a	b	c	d
0	1	2	3	4
1	2	3	4	5
2	3	4	5	6

df.assign(c=[3,4,5]).assign(d=[4,5,6])

	a	b	c	d
0	1	2	3	4
1	2	3	4	5
2	3	4	5	6

방법4: `.eval`을 이용한 할당

df = pd.DataFrame({'a':[1,2,3],'b':[2,3,4]})
df

	a	b
0	1	2
1	2	3
2	3	4

df.eval('c=[3,4,5]')

	a	b	c
0	1	2	3
1	2	3	4
2	3	4	5

df.eval('c=[3,4,5]').eval('d=[4,5,6]')

	a	b	c	d
0	1	2	3	4
1	2	3	4	5
2	3	4	5	6

연습해보기

`#` 데이터프레임 생성

df=pd.DataFrame({'x':np.random.randn(1000),'y':np.random.randn(1000)})
df

	x	y
0	-0.813856	0.635606
1	0.457182	0.334678
2	-0.473772	-1.169757
3	-0.273939	-1.044208
4	-0.619499	-0.356150
...	...	...
995	0.205837	0.422563
996	-0.058614	0.478894
997	1.874445	0.057198
998	-0.376114	1.574681
999	-0.031349	0.341959

1000 rows × 2 columns

`#` 새로운열 `r`을 생성하고 \(r=\sqrt{x^2 + y^2}\)를 계산

- 방법1: 브로드캐스팅

df.assign(r=np.sqrt(df.x**2 + df.y**2))

	x	y	r
0	-0.813856	0.635606	1.032645
1	0.457182	0.334678	0.566591
2	-0.473772	-1.169757	1.262058
3	-0.273939	-1.044208	1.079542
4	-0.619499	-0.356150	0.714578
...	...	...	...
995	0.205837	0.422563	0.470030
996	-0.058614	0.478894	0.482468
997	1.874445	0.057198	1.875318
998	-0.376114	1.574681	1.618976
999	-0.031349	0.341959	0.343393

1000 rows × 3 columns

- 방법2: (quiz) lambda + map을 이용한 개별원소 계산

- 방법3: eval

df.eval('r=sqrt(x**2+y**2)')

	x	y	r
0	-0.813856	0.635606	1.032645
1	0.457182	0.334678	0.566591
2	-0.473772	-1.169757	1.262058
3	-0.273939	-1.044208	1.079542
4	-0.619499	-0.356150	0.714578
...	...	...	...
995	0.205837	0.422563	0.470030
996	-0.058614	0.478894	0.482468
997	1.874445	0.057198	1.875318
998	-0.376114	1.574681	1.618976
999	-0.031349	0.341959	0.343393

1000 rows × 3 columns

판다스: 새로운 열의 할당 2단계 (연쇄할당)

모티브

- 원본데이터를 가급적 손상시키지 않으면서 데이터를 변형하고 싶음.

df = pd.DataFrame({'A':range(0,5),'B':range(1,6)})
df

	A	B
0	0	1
1	1	2
2	2	3
3	3	4
4	4	5

복사본 생성

df2 = df 
df2

	A	B
0	0	1
1	1	2
2	2	3
3	3	4
4	4	5

df2['C'] = (df2.A+ df2.B)/2
df2

	A	B	C
0	0	1	0.5
1	1	2	1.5
2	2	3	2.5
3	3	4	3.5
4	4	5	4.5

df2['D']= (df2.C - np.mean(df2.C))/np.std(df2.C) 
df2

	A	B	C	D
0	0	1	0.5	-1.414214
1	1	2	1.5	-0.707107
2	2	3	2.5	0.000000
3	3	4	3.5	0.707107
4	4	5	4.5	1.414214

df # 니가 왜 거기서 나와??

	A	B	C	D
0	0	1	0.5	-1.414214
1	1	2	1.5	-0.707107
2	2	3	2.5	0.000000
3	3	4	3.5	0.707107
4	4	5	4.5	1.414214

해결책1: df.copy()이용, .eval()이용

- 올바른코드1

df = pd.DataFrame({'A':range(0,5),'B':range(1,6)})
df2 = df.copy() 
df2['C'] = (df2.A+ df2.B)/2
df2['D']= (df2.C - np.mean(df2.C))/np.std(df2.C)

df2

	A	B	C	D
0	0	1	0.5	-1.414214
1	1	2	1.5	-0.707107
2	2	3	2.5	0.000000
3	3	4	3.5	0.707107
4	4	5	4.5	1.414214

df

	A	B
0	0	1
1	1	2
2	2	3
3	3	4
4	4	5

- 올바른코드2

df = pd.DataFrame({'A':range(0,5),'B':range(1,6)})
mean = np.mean 
std = np.std 
df.eval('C=(A+B)/2').eval('D=(C-@mean(C))/@std(C)')

	A	B	C	D
0	0	1	0.5	-1.414214
1	1	2	1.5	-0.707107
2	2	3	2.5	0.000000
3	3	4	3.5	0.707107
4	4	5	4.5	1.414214

어디까지 eval expression 안에서 지원되는지 명확하지 않고
외부에 함수를 선언하고 eval expression 안에 @를 붙이는게 좀 귀찮음

- 올바른코드3 (assign) –> 실패

df = pd.DataFrame({'A':range(0,5),'B':range(1,6)})
df.assign(C= (df.A+df.B)/2)

	A	B	C
0	0	1	0.5
1	1	2	1.5
2	2	3	2.5
3	3	4	3.5
4	4	5	4.5

df.assign(C= (df.A+df.B)/2).assign(D= (df.C- np.mean(df.C))/np.std(df.C))

AttributeError: 'DataFrame' object has no attribute 'C'

아래와 같이 고쳐야함

_df = df.assign(C= (df.A+df.B)/2)
_df.assign(D= (_df.C- np.mean(_df.C))/np.std(_df.C))

	A	B	C	D
0	0	1	0.5	-1.414214
1	1	2	1.5	-0.707107
2	2	3	2.5	0.000000
3	3	4	3.5	0.707107
4	4	5	4.5	1.414214

이건 우리의 철학이랑 안맞음..

해결책2: assign을 이용한 연쇄할당

실패한코드는 아래와 같다.

df.assign(C= (df.A+df.B)/2).assign(D= (df.C- np.mean(df.C))/np.std(df.C))

AttributeError: 'DataFrame' object has no attribute 'C'

두번째 assign에서 표현된 df.C 에서, df가 current df (= df.assign(C= (df.A+df.B)/2) 까지 연산된 상태) 를 의미하도록 만들고 싶다. \(\to\) 아래와 같이 lambda df: 를 추가하면 된다.

df.assign(C= (df.A+df.B)/2).assign(D= lambda df: (df.C- np.mean(df.C))/np.std(df.C))

	A	B	C	D
0	0	1	0.5	-1.414214
1	1	2	1.5	-0.707107
2	2	3	2.5	0.000000
3	3	4	3.5	0.707107
4	4	5	4.5	1.414214

- 연쇄할당

df.assign(C = (df.A+df.B)/2).assign(D = lambda df: df.C +2).assign(E = lambda df: df.D - 2)

	A	B	C	D	E
0	0	1	0.5	2.5	0.5
1	1	2	1.5	3.5	1.5
2	2	3	2.5	4.5	2.5
3	3	4	3.5	5.5	3.5
4	4	5	4.5	6.5	4.5

Quiz

다음과 같은 데이터프레임을 불러온 뒤 물음에 답하라

df=pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/_notebooks/dv2022.csv')
df

	att	rep	mid	fin
0	65	45	0	10
1	95	30	60	10
2	65	85	15	20
3	55	35	35	5
4	80	60	55	70
...	...	...	...	...
195	55	70	40	95
196	65	85	25	85
197	85	85	100	10
198	80	65	35	60
199	50	95	45	85

200 rows × 4 columns

(1) 기말고사 성적이 중간고사 성적보다 향상된 학생들을 출력하라. 즉 mid < fin 인 학생들을 출력하라. (다양한 방법으로 연습할 것, 제출은 한 가지 방법으로 구현해도 감점없음)

# 구현결과가 아래와 같아야 한다.

	att	rep	mid	fin
0	65	45	0	10
2	65	85	15	20
4	80	60	55	70
5	75	40	75	85
6	65	70	60	75
...	...	...	...	...
194	65	40	65	70
195	55	70	40	95
196	65	85	25	85
198	80	65	35	60
199	50	95	45	85

93 rows × 4 columns

(2) 기말고사 성적이 중간고사 성적보다 향상된 학생들의 출석과 레포트 점수를 출력하라.

# 구현결과가 아래와 같아야 한다.

	att	rep
0	65	45
2	65	85
4	80	60
5	75	40
6	65	70
...	...	...
194	65	40
195	55	70
196	65	85
198	80	65
199	50	95

93 rows × 2 columns

df = pd.DataFrame({'a':[1,2,3,4],'b':[2,3,4,5],'c':[3,4,5,6],'d':[4,5,6,7]})
df

	a	b	c	d
0	1	2	3	4
1	2	3	4	5
2	3	4	5	6
3	4	5	6	7

`2`.

아래의 결과를 관찰하고 drop의 기능을 유추하라.

(예시1)

df.drop(columns='a')

	b	c	d
0	2	3	4
1	3	4	5
2	4	5	6
3	5	6	7

(예시2)

df.drop(columns=['a','b'])

	c	d
0	3	4
1	4	5
2	5	6
3	6	7

(예시3)

df.drop(index=0)

	a	b	c	d
1	2	3	4	5
2	3	4	5	6
3	4	5	6	7

(예시4)

df.drop(index=range(2,4))

	a	b	c	d
0	1	2	3	4
1	2	3	4	5

문제: df 에서 a,c열을 삭제하고 첫행을 삭제하라.

#출력결과는 아래와 같아야 한다.

	b	d
1	3	5
2	4	6
3	5	7

ref

imports

복습

판다스: 인덱싱 1단계– 인덱싱의 4가지 컨셉

데이터프레임 준비

df의 4가지 컨셉

컨셉1: 클래스느낌

컨셉2: 딕셔너리 + \(\alpha\) 느낌

# col indexing

# row indexing

컨셉3: 넘파이느낌

# row indexing

# col indexing

# row + col indexing

컨셉4: 데이터프레임 느낌

# row indexing

# col indexing

# row + col indexing

컨셉1~4 정리

판다스: 인덱싱 2단계– 필터링(특정조건에 맞는 row를 선택)

att > 90 and rep < 50

att > mean(att)

판다스: 인덱싱 3단계– column의 선택 (with 실전예제)

데이터

기본인덱싱 (df 인덱싱공부 1단계 내용)

actor라는 단어가 포함된 column 선택

s로 끝나는 column 선택

c 혹은 d로 시작하는 column 선택

판다스: 새로운 열의 할당 1단계

방법1: concat

방법2: 4가지 컨셉에 따른 할당

# 컨셉1: 불가능

# 컨셉2: 가능

# 컨셉3: 불가능

# 컨셉4: 가능

방법3: .assign으로 할당 (\(\star\)) – 제 최애

방법4: .eval을 이용한 할당

연습해보기

# 데이터프레임 생성

# 새로운열 r을 생성하고 \(r=\sqrt{x^2 + y^2}\)를 계산

판다스: 새로운 열의 할당 2단계 (연쇄할당)

모티브

해결책1: df.copy()이용, .eval()이용

해결책2: assign을 이용한 연쇄할당

Quiz

2.

`#` col indexing

`#` row indexing

`#` row indexing

`#` col indexing

`#` row + col indexing

`#` row indexing

`#` col indexing

`#` row + col indexing

`#` 컨셉1: 불가능

`#` 컨셉2: 가능

`#` 컨셉3: 불가능

`#` 컨셉4: 가능

방법3: `.assign`으로 할당 (\(\star\)) – 제 최애

방법4: `.eval`을 이용한 할당

`#` 데이터프레임 생성

`#` 새로운열 `r`을 생성하고 \(r=\sqrt{x^2 + y^2}\)를 계산

`2`.